Fixed value multiplication using field-programmable gate array

ABSTRACT

A method for multiplying two binary numbers includes configuring, in an integrated circuit, a plurality of lookup tables based on a known binary number (w). The lookup tables can be configured in three layers. The method further includes receiving, by the integrated circuit, an input binary number (d). The method further includes determining, by the integrated circuit, a multiplication result (p) of the known binary number w and the input binary number d by determining each bit (p i ) from p using the lookup tables based on specific combinations of bits from the known binary number w and from the input binary number d, wherein a notation j x  represents the x th  bit of j from the right, with bit j 0  being the rightmost bit of j.

BACKGROUND

The present invention relates to computing technology, and particularlyto improvement to a fixed-value multiplier used by computing systems,where the improvement is achieved by using a field-programmable gatearray (FPGA).

An FPGA is an integrated circuit designed to be configured by a customeror a designer after manufacturing. The FPGA configuration is generallyspecified using a hardware description language (HDL), similar to thatused for an application-specific integrated circuit (ASIC). Computing amultiplication of two numbers is a common operation performed by variouscomputing systems. Several computing systems require a multiplication inwhich one of the values is known or fixed, and the second is a dynamicinput value.

SUMMARY

According to one or more embodiments of the present invention, a methodfor multiplying two binary numbers includes configuring, in anintegrated circuit, a plurality of lookup tables based on a known binarynumber (w). The lookup tables can be configured in three layers. Themethod further includes receiving, by the integrated circuit, an inputbinary number (d). The method further includes determining, by theintegrated circuit, a multiplication result (p) of the known binarynumber w and the input binary number d by determining each bit (p_(i))from p using the lookup tables based on specific combination of bitsfrom the known binary number w and from the input binary number d,wherein a notation j_(x) represents the x^(th) bit of j from the right,with bit j₀ being the rightmost bit of j.

In one or more embodiments of the present invention, the known binarynumber w has a predetermined number of bits. For example, the knownbinary number is an 8-bit binary number. Further, in one or moreembodiments of the present invention, the input binary number has apredetermined number of bits. For example, the input binary number is a12-bit binary number.

In one or more embodiments of the present invention, the bits p₅, p₄,p₃, p₂, p₁, and p₀ of p are determined by a first circuit that includesa first layer of the lookup tables from the integrated circuit based onthe bits d₅, d₄, d₃, d₂, d₁, and d₀ of the input binary number d.Further, the bits p₈, p₇, and p₆, of p are determined by a secondcircuit from the integrated circuit based on the first set of auxiliarybits computed by the first circuit. The second circuit includes a secondlayer of the lookup tables. Further yet, the bits p₁₆, p₁₅, p₁₄, p₁₃,p₁₂, p₁₁, and p₁₀ of p are determined by a third circuit from theintegrated circuit based on auxiliary bits computed by the secondcircuit. The third circuit includes a third layer of the lookup tables.In one or more embodiments of the present invention, determining the bitp₁₉ of p includes determining, using a subset of lookup tables, thatt≤d, wherein t=┌2¹⁹/w┐ is precomputed, and in response to t≤d, p₁₉ isset to 1, and otherwise p₁₉ is set to 0.

In one or more embodiments of the present invention, determining the bitp₁₈ of p includes precomputing threshold values t₀₁, t₁₀, and t₁₁:t ₀₁=┌2¹⁸ /w┐,t ₁₀=└(2¹⁹−1)/w┘, andt ₁₁=┌(2¹⁹+2¹⁸)/w┐.

In response to (t₁₁≤d) or (t₀₁≤d≤t₁₀), p₁₈ is set to 1, and otherwise to0.

Further, in one or more embodiments of the present invention,determining the bit p₁₇ of p includes precomputing threshold values:t ₀₀₁=┌2¹⁷ /w┐,t ₀₁₀=└(2¹⁸−1)/w┘,t ₀₁₁=┌(2¹⁸+2¹⁷)/w┐,t ₁₀₀=└(2¹⁹−1)/w┘,t ₁₀₁=┌(2¹⁹+2¹⁷)/w┐,t ₁₁₀=└(2¹⁹+2¹⁸−1)/w┘, andt ₁₁₁=┌(2¹⁹+2¹⁸+2¹⁷)/w┐.

P₁₇ is set to 1 in response to t₁₁₁≤d, t₁₀₁≤d≤t₁₁₀, t₀₁₁≤d≤t₁₀₀, andt₀₀₁≤d≤t₀₁₀, and to 0 otherwise.

The technical solutions described herein can also be achieved byimplementing a system that includes a memory device that stores a knownbinary number (w), and a multiplication circuit that performs the methodto determine the multiplication result (p) of the known binary numberwith an input binary number (d) that is received dynamically.

Alternatively, in one or more embodiments of the present invention, aneural network system includes a multiplication circuit for performing amethod to determine a multiplication result of a weight value with aninput value (d) that is received dynamically, the method includingconfiguring several lookup tables in an integrated circuit based on theweight value (w) that is a known value. The lookup tables can beconfigured in three layers. The method further includes determining amultiplication result (p) of the weight value w and the input value d bydetermining each bit (p_(i)) from p using the lookup tables based on aspecific combination of bits from the weight value w and from the inputvalue d, wherein a notation j_(x) represents the x^(th) bit of j fromthe right, with bit j₀ being the rightmost bit of j.

In yet another embodiment of the present invention, an electroniccircuit determines a multiplication result (p) of a weight value (w) andan input value (d) that is received dynamically. Determining themultiplication result includes configuring several lookup tables basedon the weight value (w), and determining each respective bit (p_(i)) ofthe multiplication result (p) using the lookup tables based on specificcombination of bits from the weight value w and from the input value d.The notation j_(x) represents the x^(th) bit of j from the right, withbit j₀ being the rightmost bit of j.

In another embodiment of the present invention, a field programmablegate array includes several lookup tables, wherein the fieldprogrammable gate array performs a method for determining amultiplication result (p) of a weight value (w) and an input value (d)that is received dynamically. The lookup tables can be configured inthree layers.

Embodiments of the present invention can include various otherimplementations such as machines, devices, and apparatus.

Additional technical features and benefits are realized through thetechniques of the present invention. Embodiments and aspects of theinvention are described in detail herein and are considered a part ofthe claimed subject matter. For a better understanding, refer to thedetailed description and to the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The specifics of the exclusive rights described herein are particularlypointed out and distinctly claimed in the claims at the conclusion ofthe specification. The foregoing and other features and advantages ofthe embodiments of the invention are apparent from the followingdetailed description taken in conjunction with the accompanying drawingsin which:

FIG. 1 depicts a block diagram of a system that uses a multiplicationcircuit according to one or more embodiments of the present invention;

FIG. 2 depicts an exemplary neural network system according to one ormore embodiments of the present invention;

FIG. 3 depicts a block diagram of a multiplication circuit according toone or more embodiments of the present invention;

FIG. 4 depicts a flowchart of a method for determining a multiplicationresult of a known value, w, with input value, d, according to one ormore embodiments of the present invention;

FIG. 5 depicts a first circuit used in a multiplication circuitaccording to one or more embodiments of the present invention;

FIG. 6 depicts a second circuit used in a multiplication circuitaccording to one or more embodiments of the present invention;

FIG. 7 depicts a third circuit used in a multiplication circuitaccording to one or more embodiments of the present invention;

FIG. 8 depicts a lookup table used in a multiplication circuit accordingto one or more embodiments of the present invention;

FIG. 9 depicts several lookup tables used in a multiplication circuitaccording to one or more embodiments of the present invention;

FIG. 10 depicts a lookup table used in a multiplication circuitaccording to one or more embodiments of the present invention;

FIG. 11 depicts a lookup table used in a multiplication circuitaccording to one or more embodiments of the present invention;

FIG. 12 depicts a lookup table used in a multiplication circuitaccording to one or more embodiments of the present invention;

FIG. 13 depicts a circuit used to determine a most significant bit of amultiplication result according to one or more embodiments of thepresent invention;

FIG. 14 depicts a circuit used to determine a second most significantbit of a multiplication result according to one or more embodiments ofthe present invention;

FIG. 15 depicts several lookup tables used in a multiplication circuitaccording to one or more embodiments of the present invention;

FIG. 16 depicts a circuit used to determine a third most significant bitof a multiplication result according to one or more embodiments of thepresent invention; and

FIG. 17 depicts a computing system that can be used to implement one ormore embodiments of the present invention.

The diagrams depicted herein are illustrative. There can be manyvariations to the diagrams or the operations described therein withoutdeparting from the spirit of the invention. For instance, the actionscan be performed in a differing order, or actions can be added, deleted,or modified. Also, the term “coupled” and variations thereof describehaving a communications path between two elements and do not imply adirect connection between the elements with no interveningelements/connections between them. All of these variations areconsidered a part of the specification.

DETAILED DESCRIPTION

Exemplary embodiments of the present invention provide improvedefficiency for computing a multiplication in computing systems,particularly in the case where one of the values to be multiplied is afixed (known) value, and the second value to be multiplied isdynamically input. Exemplary embodiments of the present inventionprovide a multiplication circuit for performing such a computationefficiently. The values are represented in digital format using binarynumbers. In one or more embodiments of the present invention, afield-programmable gate array (FPGA) includes a plurality of lookuptables (LUTs), the LUTs being configured in n layers to realize themultiplication circuit.

As a brief introduction, FPGAs typically contain an array ofprogrammable logic blocks and a hierarchy of reconfigurableinterconnects that allow the blocks to be “wired together,” like severallogic gates that can be coupled together in different configurations.Logic blocks can be configured to perform combinational functions, logicgates like AND and XOR, and other functions. The logic blocks alsoinclude memory elements, such as flip-flops or more complete blocks ofmemory. It should be noted that FPGAs can include components differentfrom those described herein; the above is an exemplary FPGA.

A technical challenge with computing systems is improving the timerequired for the computing system to perform calculations such asmultiplication of numbers represented as binary numbers. Technicalsolutions provided by embodiments of the present invention address suchtechnical challenges by providing an n layered multiplication circuitthat performs a multiplication in deterministic time for two binarynumbers—one fixed-value number and one variable number that is input atruntime. One or more embodiments of the present invention use FPGAs toimplement the multiplication circuit using LUTs. As used herein, a“k-to-1 Boolean function” is implemented as a k-input LUT that providesa 1-bit output given a k-bit input.

Further, the present document denotes B={0,1}, B^(n) is the set of alln-tuples of zeros and ones, and B_(n) is the set of all Booleanfunctions B^(n)→B. Also, the present document uses the same symbol xinterchangeably to denote both a Boolean vector x=(x₀, . . . ,x_(k-1))∈B^(k) and a natural number

$x = {\sum\limits_{i = 0}^{k - 1}{x_{i}{2^{k}.}}}$The operation of the addition of two Boolean vectors, including twoBoolean scalars, is denoted without confusion by the same ‘+’ sign.Accordingly, the technical challenge restated using the terminology justestablished is to compute the product of an input value d and afixed-value weight w.

FIG. 1 depicts a block diagram of a system for computing a fixed-valuebinary number multiplication according to one or more embodiments of thepresent invention. For example, the multiplication circuit 115 that isdescribed herein can be a part of a computing system 110 that receivesan input-value (d) from an input source 120. The multiplication circuit115 calculates an output p of a product of d with a known fixed value w.

In one or more embodiments of the present invention, the computingsystem 110 can be an artificial neural network system. Alternatively,the computing system 110 is a desktop computer, a server computer, atablet computer, or any other type of computing device that uses themultiplication circuit 115 to compute a product of two binary numbers,one of which has a known value (w).

In one or more embodiments of the present invention, the input source120 can be a memory, a storage device, from which the input-value isprovided to the multiplication circuit 115. Alternatively, or inaddition, the input-value can be input to the multiplication circuit 115directly upon acquisition, for example, the input source 120 is asensor, such as a camera, an audio input device, or any other type ofsensor that captures data in a form that can be input to themultiplication circuit 115.

FIG. 2 illustrates an example, non-limiting neural network system forwhich efficiency can be facilitated in accordance with one or moreembodiments of the invention. The neurons of a neural network 200 can beconnected so that the output of one neuron can serve as an input toanother neuron. Neurons within a neural network can be organized intolayers, as shown in FIG. 2. The first layer of a neural network can becalled the input layer (224), the last layer of a neural network can becalled the output layer (228), and any intervening layers of a neuralnetwork can be called a hidden layer (226). Aspects of systems (e.g.,system 200 and the like), apparatuses, or processes explained herein canconstitute machine-executable component(s) embodied within machine(s),e.g., embodied in one or more computer-readable mediums (or media)associated with one or more machines. Such component(s), when executedby the one or more machines, e.g., computer(s), computing device(s),virtual machine(s), etc. can cause the machine(s) to perform theoperations described. Repetitive description of like elements employedin respective embodiments is omitted for the sake of brevity.

The system 200 and/or the components of the system 200 can be employedto use hardware and/or software to solve problems that are highlytechnical in nature, that are not abstract and that cannot be performedas a set of mental acts by a human. For example, system 200 and/or thecomponents of the system 200 can be employed to use hardware and/orsoftware to perform operations, including facilitating an efficiencywithin a neural network. Furthermore, some of the processes performedcan be performed by specialized computers for carrying out defined tasksrelated to facilitating efficiency within a neural network. System 200and/or components of the system 200 can be employed to solve newproblems that arise through advancements in technology, computernetworks, the Internet, and the like. System 200 can further providetechnical improvements to live and Internet-based learning systems byimproving processing efficiency among processing components associatedwith facilitating efficiency within a neural network.

System 200, as depicted in FIG. 2, is a neural network that includesfive neurons—neuron 202, neuron 204, neuron 206, neuron 208, and neuron210. The input layer 224 of this neural network is comprised of neuron202 and neuron 204. The hidden layer 226 of this neural network iscomprised of neuron 206 and neuron 208. The output layer 228 of thisneural network is comprised of neuron 210. Each of the neurons of inputlayer 224 is connected to each of the neurons of hidden layer 226. Thatis, a possibly-weighted output of each neuron of input layer 224 is usedas an input to each neuron of hidden layer 226. Then, each of theneurons of hidden layer 226 is connected to each of the neurons (here,one neuron) of output layer 228.

The neural network of system 200 presents a simplified example so thatcertain features can be emphasized for clarity. It can be appreciatedthat the present techniques can be applied to other neural networks,including ones that are significantly more complex than the neuralnetwork of system 200.

In the context of artificial neural networks, each of the neuronsperforms a computation, for example, during various phases, such asforward propagation, backward propagation, and weight update. Suchcomputations can include multiplication. In one or more embodiments ofthe present invention, the computations include multiplication of aweight-value assigned to the neuron (which is a known and fixed value),and an input value (which can be variable). Here, the “weight-value”represents the weight that is assigned to a neuron in the neural network200, and the “input-value” is a value received by that neuron tocalculate the output. The calculation can be performed during thetraining of the neural network or during inference using the neuralnetwork. In one or more embodiments of the present invention, thecalculation can be performed during any phase of the training, forwardpropagation, backward propagation, weight update, or any other phase.The performance of the neural network 200 can be improved if theefficiency of the multiplication operation can be improved. One or moreembodiments of the present invention facilitate a faster way ofcalculating a multiplication of an input-value (d) with the weight-value(w). Further, one or more embodiments of the present inventionfacilitate hardware components to support such calculation using LUTs.

It is noted that although FIG. 2 depicts an embodiment of the presentinvention with the computing system 110 as a neural network system(200), in one or more embodiments of the present invention, thecomputing system 110 can be other types of computing systems thatinclude the multiplication circuit 115.

The technical solutions provided by one or more embodiments of thepresent invention are now described using an exemplary case where w isan 8-bit value, and d is a 12-bit value. It is understood that in otherembodiments of the present invention, the values can have a differentnumber of bits. However, for explaining the operation of the technicalsolutions of the present invention, the above example scenario ischosen. Accordingly, the computational problem is defined by a fixednonzero weight-value, which is a vector of bit-values w=(w₀, . . . ,w₇)∈B⁸, so

$w = {\sum\limits_{i = 0}^{7}{w_{i}{2^{i}.}}}$The input-value is a vector of bit-values d=(d₀, . . . , d₁₁)∈B¹², so

$d = {\sum\limits_{i = 0}^{11}{d_{i}{2^{i}.}}}$

The input-value d can also be represented as d=g·2⁶+h, where g and h areintegers such that 0≤g, h<2⁶. Accordingly,

$g = \left\lfloor \frac{d}{2^{6}} \right\rfloor$and h=d−g·2⁶.

In this case, p is a vector of bit-values such that p=(p₀, . . . ,p₁₉)∈B²⁰, so

${p = {\sum\limits_{i = 0}^{19}{p_{i}2^{i}}}},$and hence, 0≤p≤2²⁰−1. In this document, a function is denotedP_(i)(d)=P_(i)(d; w), where the function returns p_(i), i=0, . . . , 19.

Now, if n is any natural number, and if g is an n-to-1 Boolean function,and if ƒ₁, . . . , ƒ_(n) are n 6-to-1 Boolean functions, the compositionh(x)=g(ƒ₁, (x), . . . , ƒ_(n)(x)) is also a 6-to-1 Boolean function.Further, let x={0, . . . , 2⁶−1} and let y=x+1, where x and y areBoolean vectors. For i=0, . . . , 5:

$\begin{matrix}{{J_{i}(x)}\left\{ \begin{matrix}{{1\mspace{14mu}{if}\mspace{14mu} x_{j}} = 1} & {{{{for}\mspace{14mu} j} = 0},1,\ldots\mspace{14mu},i} \\0 & {otherwise}\end{matrix} \right.} & (1)\end{matrix}$

Under these conditions, for i=0, . . . , 5, y_(i)=x_(i) if and only ifJ_(i)(x)=0, and y₆=1 if and only if J₅(x)=1.

It can be proven that for every natural n∈

, there exists a 2-level circuit of 6-to-1 Boolean functions, where thecircuit decides for every d∈B¹² whether or not d<n. If n≥2¹², thenbecause d<2¹², the problem is trivial. Hence, consider that n<2¹². Inthis case, the number n can be uniquely represented as n=a·2⁶+b, where aand b are integers such that 0≤a, b<2⁶. Here, d<n if and only if either(i) g<a, or (ii) g=a and h<b. Defining the following functions:

${G(d)} = {{G(g)} = \left\{ {{\begin{matrix}1 & {{{if}\mspace{14mu} g} < a} \\0 & {{{if}\mspace{14mu} g} \geq a}\end{matrix}{E(d)}} = {{E(g)} = \left\{ \begin{matrix}1 & {{{if}\mspace{14mu} g} = a} \\0 & {{{if}\mspace{14mu} g} \neq a}\end{matrix} \right.}} \right.}$and

${H(d)} = {{H(h)} = \left\{ \begin{matrix}1 & {{{if}\mspace{14mu} h} < b} \\0 & {{{if}\mspace{14mu} h} \geq b}\end{matrix} \right.}$

Each of these functions is a 6-to-1 Boolean function. It should be notedthat although d∈B¹² each of these functions operates on only six bits ofd. Let J: B³→B be the following:

${J\left( {x_{1},x_{2},x_{3}} \right)} = \left\{ {\begin{matrix}\begin{matrix}1 \\1\end{matrix} & \begin{matrix}{{{if}\mspace{14mu} x_{1}} = 1} \\{{{if}\mspace{14mu} x_{2}} = {{1\mspace{14mu}{and}\mspace{14mu} x_{3}} = 1}}\end{matrix} \\0 & {{{if}\mspace{14mu} x_{2}} = {{1\mspace{14mu}{and}\mspace{14mu} x_{3}} = 0}} \\0 & {{{if}\mspace{14mu} x_{1}} = {{0\mspace{14mu}{and}\mspace{14mu} x_{2}} = 0}}\end{matrix}.} \right.$

It follows thatd<n⇔J[G(d),E(d),H(d)]=1.

Accordingly, to improve the operation of the computing system 110, themultiplication circuit 115 has to be a circuit of FPGAs of minimumpossible depth (i.e., number of layers) so that given a fixedweight-value w, the LUTs can be programmed for calculating the productw×d for any input d∈B¹²:

$\quad\begin{matrix}\; & \; & \; & \; & \; & \; & \; & \; & d_{11} & d_{10} & d_{9} & d_{8} & d_{7} & d_{6} & d_{5} & d_{4} & d_{3} & d_{2} & d_{1} & d_{0} \\\; & \; & \; & \; & \; & \; & \; & \; & \; & \; & \times & \; & w_{7} & w_{6} & w_{5} & w_{4} & w_{3} & w_{2} & w_{1} & w_{0} \\p_{19} & p_{18} & p_{17} & p_{16} & p_{15} & p_{14} & p_{13} & p_{12} & p_{11} & p_{10} & p_{9} & p_{8} & p_{7} & p_{6} & p_{5} & p_{4} & p_{3} & p_{2} & p_{1} & p_{0}\end{matrix}$

Here, each of the d_(i), w_(i), and p_(i) is a bit-value.

FIG. 3 depicts a block diagram of the multiplication circuit accordingto one or more embodiments of the present invention. The multiplicationcircuit 115 can be an FPGA in one or more embodiments of the presentinvention. Alternatively, or in addition, the multiplication circuit 115can be an application-specific integrated circuit (ASIC), or any othertype of electronic circuit that includes transistors, flip-flops, andother such components to implement lookup tables. The multiplicationcircuit 115, for this particular case of 8-bit 14) and 12-bit d,includes three levels, a first circuit 310, a second circuit 320, and athird circuit 330, each level being a separate circuit. Here, eachcircuit can be a collection of LUTs implemented by the FPGA. In the casewhere the w and/or the d has/have a different number of bits, the numberof levels in the multiplication circuit 115 can change. The circuits(310, 320, and 330) can communicate auxiliary bit-values among eachother to facilitate calculations to be efficient. The circuits (310,320, and 330) include several LUTs 350 to provide an output based on theinput-value d. The LUTs 350 are configured based on the weight-value iv,which is a known value.

FIG. 4 depicts a flowchart of a method for computing the product p ofthe input-value d and the weight-value))) according to one or moreembodiments of the present invention. FIG. 5, FIG. 6, and FIG. 7, eachdepicts a block diagram of the first circuit 310, the second circuit320, and the third circuit 330, respectively, according to one or moreembodiments of the present invention.

The method 400 includes configuring the several LUTs 350 in themultiplier circuit 115 based on the weight-value iv, at block 402. EachLUT 350 provides an output bit based on a set of input bits. The inputbits to the LUT 350 can include one or more bit-values from d. Inaddition, the input bits to the LUT 350 can include one or moreauxiliary bit-values output by another LUT 350. In one or more cases,the auxiliary bit-values from the first circuit 310 are used as inputbits to one or more LUTs 350 in the other circuits (320 and 330).Similarly, auxiliary bits from the second circuit 320 can be used asinput bits to LUTs 350 of the third circuit 330.

The method 400 includes determining an output of the first circuit 310,at block 410. Part of the output from the first circuit 310 includes apredetermined number of LSBs 510 of p, at block 412. As shown in FIG. 5,in the case with the 12-bit d and 8-bit w, the first circuit 310computes the first 6-bits of p (p₀ to p₅). The first circuit 310 furtherdetermines a first set of auxiliary bit-values (r₆ to r₁₃) 520 that isused as input bit-values to other LUTs 350 from the multiplicationcircuit 115, at block 414. The LSBs 510 and the first set of auxiliarybit-values 520 are both determined based on the six LSBs of d (d₀ tod₅). The computation of the LSBs 510 and the first set of auxiliarybit-values 520 can be expressed as:

$\begin{matrix}{\quad\begin{matrix}\; & \; & \; & \; & \; & \; & \; & \; & \; & d_{5} & d_{4} & d_{3} & d_{2} & d_{1} & d_{0} \\\; & \; & \; & \; & \; & \times & \; & w_{7} & w_{6} & w_{5} & w_{4} & w_{3} & w_{2} & w_{1} & w_{0} \\ = & r_{13} & r_{12} & r_{11} & r_{10} & r_{9} & r_{8} & r_{7} & r_{6} & p_{5} & p_{4} & p_{3} & p_{2} & p_{1} & p_{0}\end{matrix}} & (2)\end{matrix}$

Further, the first circuit 310 determines a second set of auxiliarybit-values (q₀ to q₁₃) 530 that is used as input bit-values to otherLUTs 350 from the multiplication circuit 115, at block 416. The secondset of auxiliary bit-values 520 is determined based on the six mostsignificant bits (MSBs) of d (d₆ to d₁₁). The computation of the secondset of auxiliary bit-values 530 can be expressed as:

$\begin{matrix}{\quad\begin{matrix}\; & \; & \; & \; & \; & \; & \; & \; & \; & d_{11} & d_{10} & d_{9} & d_{8} & d_{7} & d_{6} \\\; & \; & \; & \; & \; & \times & \; & w_{7} & w_{6} & w_{5} & w_{4} & w_{3} & w_{2} & w_{1} & w_{0} \\ = & q_{13} & q_{12} & q_{11} & q_{10} & q_{9} & q_{8} & q_{7} & q_{6} & q_{5} & q_{4} & q_{3} & q_{2} & q_{1} & q_{0}\end{matrix}} & (3)\end{matrix}$

In addition, the first circuit 310 determines a first ancillarybit-value (q_(9 . . . 8)) 540, at block 418. The first ancillarybit-value is communicated to the second circuit 320 and represents:

Further, referring to the flowchart in FIG. 4, method 400 includesdetermining the output bit-values of the second circuit 320, at block420. Part of the output of the second circuit 320 includes apredetermined number of bits 610 of p, at block 422. As shown in FIG. 6,in the case with the 12-bit d and 8-bit w, the second circuit 320computes p₆ to p₈ (610). The second circuit 320 further determines asecond ancillary bit-value (x₉) 620 that is used as input bit-values toother LUTs 350 from the multiplication circuit 115, at block 424. Thebit-values 610 and the second ancillary bit-value 620 are bothdetermined based on subsets of the first set of auxiliary bit-values 520and the second set of auxiliary bit-values 530 (r₈, r₇, r₆, and q₂, q₁,q₀). The computation can be expressed as:

$\begin{matrix}\frac{\begin{matrix}\; \\ + \end{matrix}\begin{matrix}\; \\\;\end{matrix}\begin{matrix}r_{8} \\q_{2}\end{matrix}\begin{matrix}r_{7} \\q_{1}\end{matrix}\begin{matrix}r_{6} \\q_{0}\end{matrix}}{\begin{matrix} = \\\;\end{matrix}\begin{matrix}x_{9} \\\;\end{matrix}\begin{matrix}p_{8} \\\;\end{matrix}\begin{matrix}p_{7} \\\;\end{matrix}\begin{matrix}p_{6} \\\;\end{matrix}} & (4)\end{matrix}$

Further, the second circuit 320 determines a third set of auxiliarybit-values (y₉ to y₁₂) 630, at block 426. The third set of auxiliarybit-values is determined using subsets of the first set of auxiliarybit-values 520 and the second set of auxiliary bit-values 530 (r₁₁, r₁₀,r₉, and q₅, q₄, q₃). The computation can be expressed as:

$\begin{matrix}\frac{\begin{matrix}\; \\ + \end{matrix}\begin{matrix}\; \\\;\end{matrix}\begin{matrix}r_{11} \\q_{5}\end{matrix}\begin{matrix}r_{10} \\q_{4}\end{matrix}\begin{matrix}r_{9} \\q_{3}\end{matrix}}{\begin{matrix} = \\\;\end{matrix}\begin{matrix}y_{12} \\\;\end{matrix}\begin{matrix}y_{11} \\\;\end{matrix}\begin{matrix}y_{10} \\\;\end{matrix}\begin{matrix}y_{9} \\\;\end{matrix}} & (5)\end{matrix}$

The second circuit 320 further determines a third ancillary bit-value(y_(10 . . . 8)) 640 and a fourth ancillary value (y_(11 . . . 9)) 650,at block 427. The ancillary bit-values are used by the third circuit330. In one or more embodiments of the present invention, the thirdancillary bit-value (y_(10 . . . 8)) 640 and the fourth ancillary value(y_(11 . . . 9)) 650 are part of the third set of auxiliary values 630.The ancillary bit-values represent a combination of one or morebit-values from the third set of auxiliary bit-values, and thecomputation can be expressed as:y _(10 . . . 9) =y ₁₀ ∧y ₉y _(11 . . . 9) =y ₁₁₁ ∧y ₁₀ ∧y ₉  (6)

Further, the second circuit 320, using (r₁₃, r₁₂, and q₁₀, q₉, q₈, q₇,q₆), determines a fourth set of auxiliary bit-values (z₁₆ to z₁₂) 660,at block 428. The computation can be expressed as:

$\begin{matrix}{\frac{\begin{matrix}\; \\ + \end{matrix}\begin{matrix}0 \\q_{10}\end{matrix}\begin{matrix}0 \\q_{9}\end{matrix}\begin{matrix}0 \\q_{8}\end{matrix}\begin{matrix}r_{13} \\q_{7}\end{matrix}\begin{matrix}r_{12} \\q_{6}\end{matrix}}{\begin{matrix}{= z_{16}} \\\;\end{matrix}\begin{matrix}z_{15} \\\;\end{matrix}\begin{matrix}z_{14} \\\;\end{matrix}\begin{matrix}z_{13} \\\;\end{matrix}\begin{matrix}z_{12} \\\;\end{matrix}}.} & (7)\end{matrix}$

Further yet, the second circuit 320 further determines a fifth ancillarybit-value (z_(14 . . . 13)) 670 and a sixth ancillary value(z_(15 . . . 13)) 680, at block 429. The computation of the ancillarybit-values can be expressed as:z _(14 . . . 13) =z ₁₄ ∧z ₁₃  (8)z _(15 . . . 13) =z ₁₅ ∧z ₁₄ ∧z ₁₃.  (9)

Additionally, the bit-value z₁₆ 662 represents the combination:z ₁₀ =q ₁₀⊗((q ₉ ∧q ₈)∧((r ₁₃ ∧q ₇)∨(r ₁₃ ∧r ¹² ∧q ₆)∨(q ₇ ∧r ₁₂ ∧q₆))),  (10)

Here, a⊗b=(a∧¬b)∨(¬a∧b). This implies that q₉ can be replaced byq_(9 . . . 8)=q₉∧q₈ for the computation of z₁₆. Accordingly, z₁₆ 662 canbe obtained as a function of six input bit-values (r₁₃, r₁₂, and q₁₀,q_(9 . . . 8), q₇, q₆) to a LUT 350. FIG. 8 depicts the LUT 350 forobtaining z₁₆ 662.

Referring back to FIG. 4, the method 400 further includes determiningthe remaining bit-values (MSBs) 710 of the product p using the thirdcircuit 330, at block 430. Determining the MSBs 710 includes usingseveral combinations of the input bit-values of d, the sets of auxiliarybit-values, and the several ancillary bit-values computed by the firstcircuit 310 and the second circuit 320.

The bit p₉ is determined using bits (y₉, x₉) based on:p ₉ ≡p ₉(y ₉ ,x ₉)=(y ₉ +x ₉)(mod 2)

The corresponding LUT 350 is depicted in view 910 of FIG. 9 to determinep₉ based on the input bit-values.

The bit p₁₀ is determined using bits (y₁₀, y₉, x₉) based on:

${P_{10} \equiv {p_{10}\left( {y_{10},y_{9},x_{9}} \right)}} = \left\{ \begin{matrix}y_{10} & {{{if}\mspace{14mu}\left( {y_{9},x_{9}} \right)} \neq \left( {1,1} \right)} \\{1 - y_{10}} & {{{if}\mspace{14mu}\left( {y_{9},x_{9}} \right)} = \left( {1,1} \right)}\end{matrix} \right.$

The corresponding LUT 350 is depicted in view 920 of FIG. 9 to determinep₁₀ based on the input bit-values.

The bit p₁₁ is determined using bits (y₁₁, y₁₀, y₉, x₉) based on:

${P_{11} \equiv {p_{11}\left( {y_{11},y_{10},y_{9},x_{9}} \right)}} = \left\{ {\begin{matrix}y_{11} & {{{if}\mspace{14mu}\left( {y_{10},y_{9},x_{9}} \right)} \neq \left( {1,1,1} \right)} \\{1 - y_{11}} & {{{if}\mspace{14mu}\left( {y_{10},y_{9},x_{9}} \right)} = \left( {1,1,1} \right)}\end{matrix}.} \right.$

The dependence of p₁₁ on y₁₀ and y₉ is only to check whether or not(y₁₀, y₉)=(1, 1). Hence, y₁₀ and y₉ can be replaced by the thirdancillary bit 640 (y_(10 . . . 9)). The corresponding LUT 350 isdepicted in view 930 of FIG. 9 to determine pH based on such inputbit-values.

Calculating p₁₂ requires computing the addition:

$\quad\begin{matrix}\; & 0 & y_{12} & 0 & 0 & x_{9} \\ + & 0 & z_{12} & y_{11} & y_{10} & y_{9} \\ = & c_{13} & p_{12} & p_{11} & p_{10} & {p_{9},}\end{matrix}$where c₁₃ is not used, and p₉, p₁₀, and p₁₁, are determined as describedearlier. Here, the dependence of p₁₂ on y₁₁, y₁₀, and y₉ is only tocheck whether or not (y₁₁, y₁₀, y₉)=(1, 1, 1). Hence, y₁₁, y₁₀, and y₉can be replaced by the fourth ancillary bit 650 (y_(11 . . . 9)). Thecorresponding LUT 350 is depicted in view 940 of FIG. 9 to determine p₁₂based on such input bit-values.

Calculating p₁₃ requires computing the addition:

$\begin{matrix}\; & 0 & 0 & y_{12} & 0 & 0 & x_{9} \\ + & 0 & z_{13} & z_{12} & y_{11} & y_{10} & y_{9} \\ = & c_{13} & p_{13} & p_{12} & p_{11} & p_{10} & p_{9}\end{matrix},$where c₁₄ is not used, and p₉, p₁₀, p₁₁, and p₁₂, are determined asdescribed earlier. Here, the dependence of p₁₃ on y₁₁, y₁₀, and y₉ isonly to check whether or not (y₁₁, y₁₀, y₉)=(1, 1, 1). Hence, y₁₁, y₁₀,and y₉ can be replaced by a single bit—the fourth ancillary bit 650(y_(11 . . . 9)). Accordingly, p₁₃ can be determined as a function offive variables (z₁₃, z₁₂, y₁₂, y_(11 . . . 9), x₉). The correspondingLUT 350 is depicted in view 950 of FIG. 9 to determine p₁₃ based on suchinput bit-values. In view 950, the calculation of c₁₄ is depicted fordenotational purposes. However, c₁₄ is not used.

Calculating p₁₄ requires computing the addition:

$\begin{matrix}\; & 0 & 0 & 0 & y_{12} & 0 & 0 & x_{9} \\ + & 0 & z_{14} & z_{13} & z_{12} & y_{11} & y_{10} & y_{9} \\ = & c_{15} & p_{14} & p_{13} & p_{12} & p_{11} & p_{10} & p_{9}\end{matrix},$where c₁₅ is not used, and p₉, p₁₀, p₁₁, p₁₂, and p₁₃ are determined asdescribed earlier.

Here, the addition is a function of eight variables. As noted earlier,the technical solutions herein overcome the technical challenge ofhandling such cases with more than six input bit-values. In thisparticular case, y₁₁, y₁₀, and y₉ can be replaced by a single bit—thefourth ancillary bit 650 (y_(11 . . . 9))—so that p₁₄ can be determinedusing the LUT 350 shown in view 1010 of FIG. 10.

Calculating p₁₅ requires computing the addition:

$\begin{matrix}\; & 0 & 0 & 0 & 0 & y_{12} & 0 & 0 & x_{9} \\ + & 0 & z_{15} & z_{14} & z_{13} & z_{12} & y_{11} & y_{10} & y_{9} \\ = & c_{16} & p_{15} & p_{14} & p_{13} & p_{12} & p_{11} & p_{10} & p_{9}\end{matrix},$where c₁₆ is not used, and p₉, p₁₀, p₁₁, p₁₂, p₁₃, and p₁₄ aredetermined as described earlier.

Here, the addition is a function of nine variables. Again, y₁₁, y₁₀, andy₉ can be replaced by a single bit—the fourth ancillary bit 650(y_(11 . . . 9)). Furthermore, the dependence of p₁₅ on z₁₄ and z₁₃ isonly to check whether or not (z₁₄, z₁₃) by the fifth ancillary bit-value670 (z_(14 . . . 13)) can be determined using the LUT 350 shown in view1110 of FIG. 11 as a function of six variables.

Calculating p₁₆ requires computing the addition:

$\begin{matrix}\; & 0 & 0 & 0 & 0 & r_{13} & r_{12} & 0 & 0 & 0 \\ + & 0 & q_{10} & q_{9} & q_{8} & q_{7} & q_{6} & 0 & 0 & x_{9} \\\; & 0 & 0 & 0 & 0 & 0 & y_{12} & y_{11} & y_{10} & y_{9} \\ = & c_{17} & p_{16} & p_{15} & p_{14} & p_{13} & p_{12} & p_{11} & p_{10} & p_{9}\end{matrix},$where (q₁₃, . . . , q₆) and (r₁₃, r₁₂) are computed in first circuit310, and (y₁₂, y₁₁, y₁₀, y₉) and x₉ are computed in the second circuit320.

As mentioned earlier, the third circuit 330 uses several ancillarybit-values and auxiliary bit-values that are determined by the firstcircuit 310 and the second circuit 320. For example, the ancillarybit-value (q_(10 . . . 8)) 550 is computed at the first circuit 310 torepresent:q _(10 . . . 8) =q ₁₀ ∧q ₉ ∧q ₈ =q ₁₀ ·q ₉ ·q ₈.

Further, in the second circuit 320, the result of the following additioncan be determined using the LUTs 350:

$\quad\begin{matrix}\; & 0 & 0 & 0 & 0 & r_{13} & r_{12} \\ + & 0 & q_{10} & q_{9} & q_{8} & q_{7} & q_{6} \\ = & t_{17} & z_{16} & z_{15} & z_{14} & z_{13} & z_{12}\end{matrix}$

The result of the addition can be computed as a Boolean function of atmost six inputs that are computed in the first circuit 310 as follows:z ₁₂ =z ₁₂(r ₁₂ ,q ₆)z ₁₃ =z ₁₃(r ₁₃ ,r ₁₂ ,q ₇ ,q ₆)z ₁₄ =z ₁₄(r ₁₃ ,r ₁₂ ,q ₈ ,q ₇ ,q ₆)z ₁₅ =z ₁₅(r ₁₃ ,r ₁₂ ,q ₉ ,q ₈ ,q ₇ ,q ₆)z ₁₆ =z ₁₆(r ₁₃ ,r ₁₂ ,q ₁₀ ,q _(9 . . . 8) ,q ₇ ,q ₆),and the bitz _(15 . . . 13) =z _(15 . . . 13)(r ₁₃ ,r ₁₂ ,q ₉ ,q ₈ ,q ₇ ,q ₆)=z ₁₅·z ₁₄ ·z ₁₃

The z_(15 . . . 13) bit-value is the sixth ancillary bit-value 680. Thebit p₁₆ can be determined in the third circuit 330 as a Boolean functionof six inputs that are determined by the LUTs 350 in the first circuit310 and/or the second circuit 320:p ₁₆ =p ₁₆(z ₁₆ ,z _(15 . . . 13) ,z ₁₂ ,y ₁₂ ,y _(11 . . . 9) ,x ₉)

The view 1210 in FIG. 12 depicts the LUT 350 for determining p₁₆ usingthe above six bit-value input.

Determining the final three MSBs using LUTs 350 is based on thefollowing description of correctness. If N, d, and w, are integers,then:

(i) w×d≥N if and only if d≥┌N/w┐.

(ii) w×d<N if and only if

$d \leq {\left\lfloor \frac{N - 1}{w} \right\rfloor.}$

The part (i) above holds true because if

${{w \times d} \geq N},{d \geq \frac{N}{w}},{\therefore\mspace{14mu}{d \geq {\left\lceil \frac{N}{w} \right\rceil.}}}$Conversely, if

${d \geq \left\lceil \frac{N}{w} \right\rceil},{{{w \times d} \geq {w \times \left\lceil \frac{N}{w} \right\rceil} \geq {w \times \frac{N}{w}}} = {N.}}$

In the case (ii) above, if w×d<N, w×d≤N−1,

${\therefore{d \leq \frac{N - 1}{w}}},$and

$d \leq {\left\lfloor \frac{N - 1}{w} \right\rfloor.}$Conversely, if

${d \leq \left\lfloor \frac{N - 1}{w} \right\rfloor},{{{w \times d} \leq {w \times \left\lfloor \frac{N - 1}{w} \right\rfloor} \leq {w \times \frac{N - 1}{w}}} = {N - {1.}}}$

Now, a description is provided for determining the MSB p₁₉ using twoLUTs 350. Consider t=┌2¹⁹/w┐·P₁₉(d)=1⇔d≥t. This holds true becauseP₁₉(d)=1 if and only if w×d≥2¹⁹. It should be noted that here, 19 isused because the result of a 12-bit d and an 8-bit w cannot exceed 2¹⁹.However, in the cases where w or d have a different number of bits, theexponent in the above condition is different. Based on the descriptionherein, a person skilled in the art can determine that for every dεB¹²,the function P₁₉(d) can be evaluated using two layers of LUTs 350.

Accordingly, referring back to the flowchart in FIG. 4, the thirdcircuit 330 determines whether d≥t, which can be precomputed as athreshold based on the known w, at block 440. If the condition issatisfied, the p₁₉ is set to 1, at block 442, else p₁₉ is set to 0, atblock 444.

FIG. 13 depicts lookup tables for determining p₁₉ according to one ormore embodiments of the present invention. The LUTs 350 include a firstLUT 1310, a second LUT 1320, and a third LUT 1330 that compares d withthe precomputed threshold t. For this purpose, t is represented ast=┌2¹⁹/w┐=u·2⁶+v, where v<2⁶. Further, d is represented as d=g·2⁶+h(g<2⁶, and h<2⁶). Accordingly, setting p₁₉ to 1 if and only if d≥t,implies setting p₁₉ to 1 if and only if ((u<g) OR (g=u AND v≤h)).

Using only 6-to-1 Boolean functions, the first LUT 1310 determines ifu<g, the second LUT 1320 determines if g=u and the third LUT 1330determines if v≤h. The output of the LUTs is a 1, if the respectiveconditions hold true, and 0 otherwise. Further, a fourth LUT 1340receives the outputs from the first LUT 1310, the second LUT 1320, andthe third LUT 1330. Depending on the received bit-values, the fourth LUT1340 determines the value of p₁₉.

Now, a description is provided for determining the MSB p₁₈ using twoLUTs 350. It should be noted that p₁₈=1, if and only if one of thefollowing conditions holds:2¹⁸ ≤w×d<2¹⁹  (i)2¹⁹+2¹⁸ ≤w×d  (ii)

The following three thresholds can be precomputed based on the known w:

$t_{01} = \left\lceil \frac{2^{18}}{w} \right\rceil$$t_{10} = \left\lfloor \frac{2^{19} - 1}{w} \right\rfloor$$t_{11} = \left\lceil \frac{2^{19} + 2^{18}}{w} \right\rceil$

Accordingly, the method 400 includes setting p₁₈ to 1, at block 452, andelse to 0, at block 454, based on the third circuit 330 determining, atblock 450, whether the following condition holds:t ₀₁ ≤d≤t ₁₀  (i)t ₁₁ ≤d.  (ii)

The above is further equivalent to setting p₁₈ to 1 if and only if oneof the following conditions holds:(t ₀₁ <d<t ₁₀) or (t ₁₁ <d)  (i)d∈{t ₀₁ ,t ₁₀ ,t ₁₁}.  (ii)

Again, consider d represented as d=g·2⁶+h. Let us denote B²={01, 10,11}, and then, for every β∈B², t_(β)=u_(β)·2⁶+v_(β), where 0≤v_(β)≤2⁶.Accordingly, determining the value for p₁₈ can be stated as p₁₈=1 if andonly if one of the following conditions holds:1. g>v ₀₁ or (g=u ₀₁ and h≥v ₀₁) (i.e., d≥t ₀₁)and2. q<u ₁₀ or (g=u ₁₀ and h≤v ₁₀) (i.e., d≤t ₁₀)  (i)g>u ₁₁ or (g=u ₁₁ and h≥v ₁₁) (i.e., d≥t ₁₁)  (ii)

This can be simplified as p₁₈=1 if and only if one of the followingeight conditions holds:(u ₀₁ <h<u ₁₀) or u ₁₁ <g)  00:g=u ₀₁ and h≥v ₀₁  01:g=u ₁₀ and h≤v ₁₀  10:g=u ₁₁ and h≥v ₁₁  11:

Consider the notation, that with any inequality of variables x<y, z≤w,etc., a truth value Ø(x<y)∈{0,1}, where 1 is “true,” and 0 is “false,”and logical connectives can be applied in the form, for example,Ø(x=y)=Ø(x≤y)∧Ø(x≥y).

Accordingly, the above eight conditions can be succinctly stated as:P ₁₈(d)=[ϕ(t ₀₁ ≤d)∧ϕ(d≤t ₁₀]∨ϕ(t ₁₁ ≤d).

The values for t₀₁, t₁₀, and t₁₁ can be represented as:t ₀₁ =u ₀₁·2⁶ +v ₀₁ where 0≤v ₀₁<2⁶,t ₁₀ =u ₁₀·2⁶ +v ₁₀ where 0≤v ₁₀<2⁶,t ₁₁ =u ₁₁·2⁶ +v ₁₁ where 0≤v ₁₁<2⁶.

Now, for every β∈{01, 10, 11, u_(β)=t_(β)/2⁶; hence, u₀₁≤u₁₀≤u₁₁. Eachpossible number g is related to u₀₁, u₁₀, and u₁₁ in one of sevenpossible ways, which can be labeled with three bits as follows:g<u ₀₁  (001):g=u ₀₁  (010):u ₀₁ <g<u ₁₀  (011):g=u ₁₀  (100):u ₁₀ <g<u ₁₁  (101):g=u ₁₁  (110):u ₁₁ <g  (111):

It follows that given g, the particular case label (x₁, x₂, x₃) can bereturned by three 6-to-1 Boolean functions, x₁(g), x₂(g), x₃(g). Given(x₁, x₂, x₃), the information required for the evaluation of P₁₈(d) canbe expressed as:(u ₀₁ <g)≡x ₁∨(¬x ₁ ∧x ₂ ∧x ₃₎(u ₀₁ =g)≡x ₁ ∧¬x ₂ ∧x ₃(g<u ₁₀)≡¬x ₁(g=u ₁₀)≡x ₁ ∧¬x ₂ ∧¬x ₃(u ₁₁ <g)≡x ₁ ∧x ₂ ∧x ₃(u ₁₁ =g)≡x ₁ ∧x ₂ ∧¬x ₃

The situation with respect to the relation of h to the v_(β) is simpler.The only information required for the evaluation of P₁₈(d) is capturedby the following 6-to-1 Boolean functions:

${y_{1}(h)} = \left\{ {{\begin{matrix}1 & {{{if}\mspace{14mu} h} \geq v_{01}} \\0 & {otherwise}\end{matrix}{y_{2}(h)}} = \left\{ {{\begin{matrix}1 & {{{if}\mspace{14mu} h} \leq v_{10}} \\0 & {otherwise}\end{matrix}{y_{3}(h)}} = \left\{ {\begin{matrix}1 & {{{if}\mspace{14mu} h} \geq v_{11}} \\0 & {otherwise}\end{matrix}.} \right.} \right.} \right.$

At this stage, it can be shown that for every d∈B₁₂ the function P₁₈(d)can be evaluated in two layers. The evaluation of P₁₈(d) relies on theinequality relations of d to the t_(β)s. Accordingly:d≥t ₀₁⇔(g>u ₀₁∨[(g=u ₀₁)∧(h≥v ₀₁)]d≤t ₁₀⇔(g<u ₁₀∨[(g=u ₁₀)∧(h≤v ₁₀)]d≥t ₁₁⇔(g>u ₁₁∨[(g=u ₁₁)∧(h≥v ₁₁)]Further:ϕ(d≥t ₀₁)=ϕ(g>u ₀₁)∨(ϕ(g≤u ₀₁)∧y ₁(h))ϕ(d≤t ₁₀)=ϕ(g<u ₀₁)∨(ϕ(g≤u ₁₀)∧y ₂(h))ϕ(d≥t ₁₁)=ϕ(g>u ₁₁)∨(ϕ(g≤u ₁₁)∧h ₃(y))

Thus, it follows that the relations of d to the t_(β)s can be evaluatedby the following 6-to-1 Boolean function, applied to the 6-tuple (x₁(g),x₂(g), x₃(g), y₁(h), y₂(h), y₃(h)).

FIG. 14 depicts lookup tables for determining p₁₈ according to one ormore embodiments of the present invention. The LUTs 350 for determiningp₁₈ are in a two-layer setup. The LUTs 350 include a first LUT 1410, asecond LUT 1420, and a third LUT 1430 that determine x₁(g), x₂(g),x₃(g), respectively. Further, the LUTs 350 include a fourth LUT 1440, afifth LUT 1450, and a sixth LUT 1460 that determine y₁(h), y₂(h), y₃(h),respectively. The bit-values of the x₁(g), x₂(g), x₃(g), y₁(h), y₂(h),y₃(h), are used to determine p₁₈.

It can be shown that if 1≤w≤256, then u₀₁<u₁₀. That is because even ifw=256,

${u_{01} = {\left\lfloor \frac{\left\lfloor {{2^{18}/2}56} \right\rfloor}{2^{6}} \right\rfloor = {\left\lfloor \frac{2^{10}}{2^{6}} \right\rfloor = 2^{4}}}};{and}$$u_{10} = {\left\lfloor \frac{\left\lceil {{\left( {2^{19} - 1} \right)/2}56} \right\rceil}{2^{6}} \right\rfloor = {\left\lfloor \frac{2^{11} - 1}{2^{6}} \right\rfloor = {2^{5} - 1}}}$

Now, a description is provided for determining the third MSB p₁₇ usingthree layers of LUTs. p₁₇=1 if and only if one of the followingconditions holds:2¹⁷ ≤w×d<2¹⁸  (i)2¹⁸+2¹⁷ ≤w×d<2¹⁹  (ii)2¹⁹+2¹⁷ ≤w×d<2¹⁹+2¹⁷  (iii)2¹⁹+2¹⁸+2¹⁷ ≤w×d  (iv)

The conditions can be restated using seven thresholds:

$t_{001} = \left\lceil \frac{2^{17}}{w} \right\rceil$$t_{010} = \left\lfloor \frac{2^{18} - 1}{w} \right\rfloor$$t_{011} = \left\lceil \frac{2^{18} + 2^{17}}{w} \right\rceil$$t_{100} = \left\lfloor \frac{2^{19} + 1}{w} \right\rfloor$$t_{101} = \left\lceil \frac{2^{19} + 2^{17}}{w} \right\rceil$$t_{110} = \left\lfloor \frac{2^{19} + 2^{18} - 1}{w} \right\rfloor$$t_{111} = \left\lceil \frac{2^{19} + 2^{18} + 2^{17}}{w} \right\rceil$

The conditions can be restated using these seven thresholds as:t ₀₀₁ ≤d≤t ₀₁₀  (i)t ₀₁₁ ≤d≤t ₁₀₀  (ii)t ₁₀₁ ≤d≤t ₁₁₀  (iii)t ₁₁₁ ≤d.  (iv)

Referring to the flowchart in FIG. 4, method 400 includes setting p₁₇=1,at block 462, if the above conditions using the thresholds are met, atblock 460; otherwise, p₁₇=0, at block 464. The conditions can be furtherrestated as p₁₇=1 if and only if one of the following conditions holds:(t ₀₀₁ <d<t ₀₁₀) or (t ₀₁₁ <d<t ₁₀₀) or (t ₁₀₁ <d<t ₁₁₀) or (t ₁₁₁<d)  (i)d∈{t ₀₀₁ ,t ₀₁₀ ,t ₀₁₁ ,t ₁₀₀ ,t ₁₀₁ ,t ₁₁₀ ,t ₁₁₁}.  (ii)

Again, consider d represented as d=g·2⁶+h. Let us denote B³={001, 010,011, 100, 101, 110, 111}, and then, for every β∈B³,t_(β)=u_(β)·2⁶+v_(β), where 0≤v_(β)≤2⁶. Accordingly, determining thevalue for p₁₈ can be stated as p₁₇=1 if and only if one of the followingconditions holds:1. g>u ₀₀₁ or (g=u ₀₀₁ and h≥v ₀₀₁) (i.e., d≥t ₀₀₁)and2. g<u ₀₁₀ or (g=u ₀₁₀ and h≤v ₀₁₀) (i.e., d≤t ₀₁₀)  (i)1. g>u ₀₁₁ or (g=u ₀₁₁ and h≥v ₀₁₁) (i.e., d≥t ₀₁₁)and2. g<u ₁₀₀ or (g=u ₁₀₀ and h≤v ₁₀₀) (i.e., d≤t ₁₀₀)  (ii)1. g>u ₁₀₁ or (g=u ₁₀₁ and h≥v ₁₀₁) (i.e., d≥t ₁₀₁)and2. g<u ₁₁₀ or (g=u ₁₁₀ and h≤v ₁₁₀) (i.e., d≤t ₁₁₀)  (iii)g<u ₁₁₁ or (g=u ₁₁ and h≥v ₁₁₁) (i.e., d≥t ₁₁₁).  (iv)

This can be simplified as p_(i)7=1 if and only if one of the followingeight conditions holds:(u ₀₀₁ <g<u ₀₁₀) or (u ₀₁₁ <g<u ₁₀₀) or (u ₁₀₁ <g<u ₁₁₀) or (u ₁₁₁<g)  000:g=u ₀₀₁ and h≥v ₀₀₁  001:g=u ₀₁₀ and h≤v ₀₁₀  010:g=u ₀₁₁ and h≥v ₀₁₁  011:g=u ₁₀₀ and h≤v ₁₀₀  100:g=u ₁₀₁ and h≥v ₁₀₁  101:g=u ₁₁₀ and h≤v ₁₁₀  110:g=u ₁₁₁ and h≥v ₁₁₁  111:

Any Boolean function of (x₁, . . . , x₄) is also a Boolean function ofg, so it can be evaluated in the first circuit 310. Accordingly, tocompress the representation to three bits based on the following:

If g satisfies one of the following, then P₁₇(d)=1:u ₀₀₁ <g<u ₀₁₀  (i)u ₀₁₁ <g<u ₁₀₀  (ii)u ₁₀₁ <g<u ₁₁₀  (iii)u ₁₁₁ <g  (iv)

If g satisfies one of the following, then P₁₇(d)=0:g<u ₀₀₁  (i)u ₀₁₀ <g<u ₀₁₁  (ii)u ₁₀₀ <g<u ₁₀₁  (iii)u ₁₁₀ <g<u ₁₁₁  (iv)

However, this still needs encoding of seven possible equalities, namelyg=u_(β), β∈B³, and so, three bits cannot be used to encode nice cases.However, for every β∈B³, the case g=u_(β) can be encoded by 0β; forexample, g=u₀₁₁ is encoded by 011. Accordingly, the case for P₁₇(d)=1can be encoded by:(u ₀₀₁ <g<u ₀₁₀) or (u ₀₁₁ <g<u ₁₀₀) or (u ₁₀₁ <g<u ₁₁₀) or (u ₁₁₁ <g)

Additionally, the case for P₁₇(d)=0 can be encoded by:(u ₀₁₀ <g<u ₀₁₁) or (u ₁₀₀ <g<u ₁₀₁) or (u ₁₁₀ <g<u ₁₁₁)

This four-bit encoding is denoted as (z₁, z₂, z₃, z₄). The encoding isdepicted by view 1510 in FIG. 15. Thus, the four Boolean functions are:z ₁=ϕ(u ₀₀₁ <g<u ₀₁₀)∨ϕ(u ₀₁₁ <g<u ₁₀₀)∨ϕ(u ₁₀₁ <g<u ₁₁₀)∨ϕ(u ₁₁₁ <g)z ₂=ϕ(g≤u ₁₀₀)∨ϕ(g=u ₁₀₁)∨ϕ(g=u ₁₁₀)∨ϕ(g=u ₁₁₁)z ₃ϕ(g=u ₀₁₀∨ϕ(g=u ₀₁₁∨ϕ(g=u ₁₁₀)∨ϕ(g=u ₁₁₁)z ₄=ϕ(g=u ₀₀₁)∨ϕ(g≤u ₀₁₁)∨ϕ(g=u ₁₀₁)∨ϕ(g=u ₁₁₁)

The situation with respect to the relation of h to the v_(β)s is simplerbut more complicated than the case of p₁₈. Here, the followingtruth-values are required:ϕ(h≥v ₀₀₁), ϕ(h≤v ₀₁₀),ϕ(h≥v ₀₁₁), ϕ(h≤v ₁₀₀),ϕ(h≥v ₁₀₁), ϕ(h≤v ₁₁₀),ϕ(h≥v ₁₁₁).

Let v₁<v₂< . . . <v_(l) (l≤7) be the distinct elements of {v_(β): β∈B³}and let ψ(β)∈{1, . . . , l} be the index such that v_(ψ(β))=v_(β). Thus,the above seven values can be expressed as:ϕ(h≥v _(ψ(001))), ϕ(h≤v _(ψ(010))),ϕ(h≥v _(ψ(011))), ϕ(h≤v _(ψ(100))),ϕ(h≥v _(ψ(101))), ϕ(h≤v _(ψ(11))),ϕ(h≥v _(ψ(111))).

The relations of h to all of these seven values can be captured by threebits (y₁, y₂, y₃) as follows. First note, that each occurrence of v_(β)is involved in precisely one inequality, namely, depending on β eitherØ(h≥v_(β)) has to be known or Ø(h≤v_(β)) has to be known. The formeroccurs when β∈{001, 011, 101, 111}, and the latter when β∈{010, 100,110}. Therefore, there are precisely seven cases that are needed tocharacterize the location of h with respect to v₁<v₂< . . . <v_(l) sothat the information required is retrieved. The cases can be viewed as apartition of the set {0, 1, . . . 63} into at most eight intervals, someof which may consist of a single point. These seven cases are defined byinserting the seven inequality signs that occur in the above sevenconditions in the appropriate places. The inequality signs also takeinto account v₁, . . . , v_(l) as follows. Let 1≤i≤l, and consider acertain value v_(i). If i=ψ(β), then (i) if β∈{001, 011, 101, 111}, theninclude an inequality ≤v_(i), and (ii) if β∈{010, 100, 110}, theninclude an inequality v_(i)≤. If there exists β₁∈{001, 011, 101, 111},and β₂∈{010, 100, 110}, such that i=ψ(β₁)=ψ(β₂), then two inequalitiesare included: ≤v_(i), and v_(i)≤. If it has to be known whether or noth≤v_(β), then in the partition v_(β) must be the right endpoint of oneof the intervals, and if it has to be known whether or not h≥v_(β), thenin the partition v_(β) must be the left endpoint of one of theintervals. This way, if it is known which of the intervals contains h,then the information that is required about its relation to any v_(β) isknown. For example, suppose:v ₀₀₁=3 v ₀₁₀=4v ₀₁₁=7 v ₁₀₀=8v ₁₀₁=11 v ₁₁₀=12v ₁₁₁=15

Here, it is to be determined whether or not h≥3, whether or not h≤4,whether or not h≥7, whether or not h≤8, whether or not h≥11, whether ornot h≤12, and whether or not h≥15. Therefore, the partition into eightintervals is the following:0≤h≤2  000:3≤h≤4  001:5≤h≤6  010:7≤h≤8  011:9≤h≤10  100:11≤h≤12  101:13≤h≤14  110:15≤h≤63  111:

The labels on the left provide the encoding with three bits, so thecorresponding three Boolean functions are represented in view 1520 inFIG. 15. The Boolean functions b₁, b₂, and b₃ are the following:b ₁(h)=ϕ(9≤h≤63)b ₂(h)=ϕ(5≤h≤8)∨ϕ(13≤h≤63)b ₃(h)=ϕ(3≤h≤4)∨ϕ(7≤h≤8)∨ϕ(11≤h≤12)∨ϕ(15≤h≤63)

Thus, it follows that the relations of d to the t_(β)s can be evaluatedby a single 7-to-1 Boolean function ƒ(z₁, . . . , z₄, b₁, b₂, b₃). Thisevaluation can be carried out as follows. First, a Boolean functionƒ(z₂, z₃, z₄, b₁, b₂, b₃) is defined and implemented in the secondcircuit 320 by setting its value to 1 when (z₂, z₃, z₄)≠(0, 0, 0), andevery x in the interval indicated by (b₁, b₂, b₃) satisfies the bound onh that is required given that g has the value indicated by (z₂, z₃, z₄);otherwise, ƒ(z₂, z₃, z₄, b₁, b₂, b₃)=0. In the third circuit 330, thevalue of P₁₇ is set to 1 if z₁=1 or ƒ(z₂, z₃, z₄, b₁, b₂, b₃)=1;otherwise, P₁₇ is set to 0.

FIG. 16 depicts a three-layer LUT circuit for determining p₁₇ accordingto one or more embodiments of the present invention. The LUTs 350include a set of four LUTs—a first LUT 1610, a second LUT 1620, a thirdLUT 1630, and a fourth LUT 1640—for determining results of z₁, z₂, z₃,and z₄, respectively by comparing portions of g and u as describedabove. Further, the LUTs 350 include another set of three LUTs—a fifthLUT 1650, a sixth LUT 1660, and a seventh LUT 1670—for determining b₁,b₂, b₃, respectively by encoding h as described above. An eighth LUT1680 uses the z₂, z₃, z₄, and b₁, b₂, b₃ to determine an output a₁. Aninth LUT 1690 uses z₁ and a₁ to determine p₁₇.

Accordingly, embodiments of the present invention provide a circuit of6-input Boolean gates for multiplying a given (known) Boolean vectorw=(w₇, w₆, w₅, w₄, w₃, w₂, w₁, w₀) by any Boolean input vector d=(d₁₁,d₁₀, d₉, d₈, d₇, d₆, d₅, d₄, d₃, d₂, d₁, d₀). w has a predeterminedlength, for example, eight bits. d has a predetermined length, forexample, twelve bits. By implementing the circuit using LUTs, forexample, in an FPGA, an ASIC, or any other electronic circuit or device,embodiments of the present invention improve the efficiency ofdetermining the product result in lesser time than performing thecomputation. The circuit can be used in a variety of computingenvironments, such as a neural network system, a computing device, aquantum computer, a mainframe computer, a memory controller, or anyother type of apparatus that requires computing multiplications, andparticularly where one of the numbers in the multiplication is a knownvalue.

Further, embodiments of the present invention use FPGAs that limit eachof the LUTs to use at most six inputs. Accordingly, embodiments of thepresent invention facilitate a practical application of determining amultiplication result and improving the efficiency of such computationsperformed by present solutions. Embodiments of the present invention,accordingly, provide an improvement to a particular technology, in thiscase, computing technology. Further yet, embodiments of the presentinvention facilitate improvements to present solutions such as neuralnetworks and other types of computing systems by improving theirefficiency at computing such multiplications, the results of which areused in various applications.

The neural network system 200 can be implemented using a computer systemor any other apparatus. Turning now to FIG. 17, a computer system 1700is generally shown in accordance with an embodiment. The computer system1700 can be an electronic, computer framework comprising and/oremploying any number and combination of computing devices and networksutilizing various communication technologies, as described herein. Thecomputer system 1700 can be easily scalable, extensible, and modular,with the ability to change to different services or reconfigure somefeatures independently of others. The computer system 1700 may be, forexample, a server, desktop computer, laptop computer, tablet computer,or smartphone. In some examples, the computer system 1700 may be a cloudcomputing node. Computer system 1700 may be described in the generalcontext of computer system executable instructions, such as programmodules, being executed by a computer system. Generally, program modulesmay include routines, programs, objects, components, logic, datastructures, and so on that perform particular tasks or implementparticular abstract data types. Computer system 1700 may be practiced indistributed cloud computing environments where tasks are performed byremote processing devices that are linked through a communicationsnetwork. In a distributed cloud computing environment, program modulesmay be located in both local and remote computer system storage media,including memory storage devices.

As shown in FIG. 17, the computer system 1700 has one or more centralprocessing units (CPU(s)) 1701 a, 1701 b, 1701 c, etc. (collectively orgenerically referred to as processor(s) 1701). The processors 1701 canbe a single-core processor, multi-core processor, computing cluster, orany number of other configurations. The processors 1701, also referredto as processing circuits, are coupled via a system bus 1702 to a systemmemory 1703 and various other components. The system memory 1703 caninclude a read-only memory (ROM) 1704 and a random access memory (RAM)1705. The ROM 1704 is coupled to the system bus 1702 and may include abasic input/output system (BIOS), which controls certain basic functionsof the computer system 1700. The RAM is read-write memory coupled to thesystem bus 1702 for use by the processors 1701. The system memory 1703provides temporary memory space for operations of said instructionsduring operation. The system memory 1703 can include random accessmemory (RAM), read-only memory, flash memory, or any other suitablememory systems.

The computer system 1700 comprises an input/output (I/O) adapter 1706and a communications adapter 1707 coupled to the system bus 1702. TheI/O adapter 1706 may be a small computer system interface (SCSI) adapterthat communicates with a hard disk 1708 and/or any other similarcomponent. The I/O adapter 1706 and the hard disk 1708 are collectivelyreferred to herein as a mass storage 1710.

Software 1711 for execution on the computer system 1700 may be stored inthe mass storage 1710. The mass storage 1710 is an example of a tangiblestorage medium readable by the processors 1701, where the software 1711is stored as instructions for execution by the processors 1701 to causethe computer system 1700 to operate, such as is described hereinbelowwith respect to the various Figures. Examples of computer programproduct and the execution of such instruction is discussed herein inmore detail. The communications adapter 1707 interconnects the systembus 1702 with a network 1712, which may be an outside network, enablingthe computer system 1700 to communicate with other such systems. In oneembodiment, a portion of the system memory 1703 and the mass storage1710 collectively store an operating system, which may be anyappropriate operating system, such as the z/OS or AIX operating systemfrom IBM Corporation, to coordinate the functions of the variouscomponents shown in FIG. 17.

Additional input/output devices are shown as connected to the system bus1702 via a display adapter 1715 and an interface adapter 1716 and. Inone embodiment, the adapters 1706, 1707, 1715, and 1716 may be connectedto one or more I/O buses that are connected to the system bus 1702 viaan intermediate bus bridge (not shown). A display 1719 (e.g., a screenor a display monitor) is connected to the system bus 1702 by a displayadapter 1715, which may include a graphics controller to improve theperformance of graphics intensive applications and a video controller. Akeyboard 1721, a mouse 1722, a speaker 1723, etc. can be interconnectedto the system bus 1702 via the interface adapter 1716, which mayinclude, for example, a Super I/O chip integrating multiple deviceadapters into a single integrated circuit. Suitable I/O buses forconnecting peripheral devices such as hard disk controllers, networkadapters, and graphics adapters typically include common protocols, suchas the Peripheral Component Interconnect (PCI). Thus, as configured inFIG. 17, the computer system 1700 includes processing capability in theform of the processors 1701, and, storage capability including thesystem memory 1703 and the mass storage 1710, input means such as thekeyboard 1721 and the mouse 1722, and output capability including thespeaker 1723 and the display 1719.

In some embodiments, the communications adapter 1707 can transmit datausing any suitable interface or protocol, such as the internet smallcomputer system interface, among others. The network 1712 may be acellular network, a radio network, a wide area network (WAN), a localarea network (LAN), or the Internet, among others. An external computingdevice may connect to the computer system 1700 through the network 1712.In some examples, an external computing device may be an external webserver or a cloud computing node.

It is to be understood that the block diagram of FIG. 17 is not intendedto indicate that the computer system 1700 is to include all of thecomponents shown in FIG. 17. Rather, the computer system 1700 caninclude any appropriate fewer or additional components not illustratedin FIG. 17 (e.g., additional memory components, embedded controllers,modules, additional network interfaces, etc.). Further, the embodimentsdescribed herein with respect to computer system 1700 may be implementedwith any appropriate logic, wherein the logic, as referred to herein,can include any suitable hardware (e.g., a processor, an embeddedcontroller, or an application-specific integrated circuit, amongothers), software (e.g., an application, among others), firmware, or anysuitable combination of hardware, software, and firmware, in variousembodiments.

The present invention may be a system, a method, and/or a computerprogram product at any possible technical detail level of integration.The computer program product may include a computer-readable storagemedium (or media) having computer-readable program instructions thereonfor causing a processor to carry out aspects of the present invention.

The computer-readable storage medium can be a tangible device that canretain and store instructions for use by an instruction executiondevice. The computer-readable storage medium may be, for example, but isnot limited to, an electronic storage device, a magnetic storage device,an optical storage device, an electromagnetic storage device, asemiconductor storage device, or any suitable combination of theforegoing. A non-exhaustive list of more specific examples of thecomputer-readable storage medium includes the following: a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), a static random access memory (SRAM), a portablecompact disc read-only memory (CD-ROM), a digital versatile disk (DVD),a memory stick, a floppy disk, a mechanically encoded device such aspunch-cards or raised structures in a groove having instructionsrecorded thereon, and any suitable combination of the foregoing. Acomputer-readable storage medium, as used herein, is not to be construedas being transitory signals per se, such as radio waves or other freelypropagating electromagnetic waves, electromagnetic waves propagatingthrough a waveguide or other transmission media (e.g., light pulsespassing through a fiber-optic cable), or electrical signals transmittedthrough a wire.

Computer-readable program instructions described herein can bedownloaded to respective computing/processing devices from acomputer-readable storage medium or to an external computer or externalstorage device via a network, for example, the Internet, a local areanetwork, a wide area network and/or a wireless network. The network maycomprise copper transmission cables, optical transmission fibers,wireless transmission, routers, firewalls, switches, gateway computers,and/or edge servers. A network adapter card or network interface in eachcomputing/processing device receives computer-readable programinstructions from the network and forwards the computer-readable programinstructions for storage in a computer-readable storage medium withinthe respective computing/processing device.

Computer-readable program instructions for carrying out operations ofthe present invention may be assembler instructions,instruction-set-architecture (ISA) instructions, machine instructions,machine-dependent instructions, microcode, firmware instructions,state-setting data, configuration data for integrated circuitry, oreither source-code or object code written in any combination of one ormore programming languages, including an object-oriented programminglanguage such as Smalltalk, C++, or the like, and procedural programminglanguages, such as the “C” programming language or similar programminglanguages. The computer-readable program instructions may executeentirely on the user's computer, partly on the user's computer, as astand-alone software package, partly on the user's computer and partlyon a remote computer or entirely on the remote computer or server. Inthe latter scenario, the remote computer may be connected to the user'scomputer through any type of network, including a local area network(LAN) or a wide area network (WAN), or the connection may be made to anexternal computer (for example, through the Internet using an InternetService Provider). In some embodiments, electronic circuitry including,for example, programmable logic circuitry, field-programmable gatearrays (FPGA), or programmable logic arrays (PLA) may execute thecomputer-readable program instruction by utilizing state information ofthe computer-readable program instructions to personalize the electroniccircuitry, in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer-readable program instructions.

These computer-readable program instructions may be provided to aprocessor of a general-purpose computer, special purpose computer, orother programmable data processing apparatus to produce a machine, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, create means forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks. These computer-readable program instructionsmay also be stored in a computer-readable storage medium that can directa computer, a programmable data processing apparatus, and/or otherdevices to function in a particular manner, such that thecomputer-readable storage medium having instructions stored thereincomprises an article of manufacture including instructions whichimplement aspects of the function/act specified in the flowchart and/orblock diagram block or blocks.

The computer-readable program instructions may also be loaded onto acomputer, other programmable data processing apparatus, or other devicesto cause a series of operational steps to be performed on the computer,other programmable apparatus or other devices to produce acomputer-implemented process, such that the instructions which executeon the computer, other programmable apparatus, or other device implementthe functions/acts specified in the flowchart and/or block diagram blockor blocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods, and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof instructions, which comprises one or more executable instructions forimplementing the specified logical function(s). In some alternativeimplementations, the functions noted in the blocks may occur out of theorder noted in the Figures. For example, two blocks shown in successionmay, in fact, be executed substantially concurrently, or the blocks maysometimes be executed in the reverse order, depending upon thefunctionality involved. It will also be noted that each block of theblock diagrams and/or flowchart illustration, and combinations of blocksin the block diagrams and/or flowchart illustration, can be implementedby special purpose hardware-based systems that perform the specifiedfunctions or acts or carry out combinations of special purpose hardwareand computer instructions.

The descriptions of the various embodiments of the present inventionhave been presented for purposes of illustration, but are not intendedto be exhaustive or limited to the embodiments disclosed. Manymodifications and variations will be apparent to those of ordinary skillin the art without departing from the scope and spirit of the describedembodiments. The terminology used herein was chosen to best explain theprinciples of the embodiments, the practical application or technicalimprovement over technologies found in the marketplace, or to enableothers of ordinary skill in the art to understand the embodimentsdescribed herein.

Various embodiments of the invention are described herein with referenceto the related drawings. Alternative embodiments of the invention can bedevised without departing from the scope of this invention. Variousconnections and positional relationships (e.g., over, below, adjacent,etc.) are set forth between elements in the following description and inthe drawings. These connections and/or positional relationships, unlessspecified otherwise, can be direct or indirect, and the presentinvention is not intended to be limiting in this respect. Accordingly, acoupling of entities can refer to either a direct or an indirectcoupling, and a positional relationship between entities can be a director indirect positional relationship. Moreover, the various tasks andprocess steps described herein can be incorporated into a morecomprehensive procedure or process having additional steps orfunctionality not described in detail herein.

The following definitions and abbreviations are to be used for theinterpretation of the claims and the specification. As used herein, theterms “comprises,” “comprising,” “includes,” “including,” “has,”“having,” “contains” or “containing,” or any other variation thereof,are intended to cover a non-exclusive inclusion. For example, acomposition, a mixture, process, method, article, or apparatus thatcomprises a list of elements is not necessarily limited to only thoseelements but can include other elements not expressly listed or inherentto such composition, mixture, process, method, article, or apparatus.

Additionally, the term “exemplary” is used herein to mean “serving as anexample, instance or illustration.” Any embodiment or design describedherein as “exemplary” is not necessarily to be construed as preferred oradvantageous over other embodiments or designs. The terms “at least one”and “one or more” may be understood to include any integer numbergreater than or equal to one, i.e., one, two, three, four, etc. Theterms “a plurality” may be understood to include any integer numbergreater than or equal to two, i.e., two, three, four, five, etc. Theterm “connection” may include both an indirect “connection” and a direct“connection.”

The terms “about,” “substantially,” “approximately,” and variationsthereof, are intended to include the degree of error associated withmeasurement of the particular quantity based upon the equipmentavailable at the time of filing the application. For example, “about”can include a range of ±8% or 5% or 2% of a given value.

For the sake of brevity, conventional techniques related to making andusing aspects of the invention may or may not be described in detailherein. In particular, various aspects of computing systems and specificcomputer programs to implement the various technical features describedherein are well known. Accordingly, in the interest of brevity, manyconventional implementation details are only mentioned briefly herein orare omitted entirely without providing the well-known system and/orprocess details.

What is claimed is:
 1. A method for multiplying two binary numbers, themethod comprising: configuring, in an integrated circuit, a plurality oflookup tables based on a known binary number (w), the lookup tablesbeing configured in three layers; receiving, by the integrated circuit,an input binary number (d); and determining, by the integrated circuit,a binary multiplication result (p) of the known binary number w and theinput binary number d by determining each bit (p_(i)) from p using thelookup tables, based on a specific combination of bits from the knownbinary number w and from the input binary number d, wherein the notationj_(x) is binary notation to represent the x^(th) rightmost bit of abinary number j, bit j₀ being the rightmost bit of j.
 2. The method ofclaim 1, wherein the known binary number w is an 8-bit binary number. 3.The method of claim 1, wherein the input binary number d has apredetermined number of bits.
 4. The method of claim 3, wherein theinput binary number d is a 12-bit binary number.
 5. The method of claim4, wherein the input binary number d is received from a 12-bit imagesensor.
 6. The method of claim 1, wherein bits p₅, p₄, p₃, p₂, p₁, andp₀ of the multiplication result p are determined by a first circuit fromthe integrated circuit based on bits d₅, d₄, d₃, d₂, d₁, and d₀ of theinput binary number d, the first circuit comprising a first one of thelayers of the lookup tables.
 7. The method of claim 6, wherein bits p₈,p₇, and p₆, of the multiplication result p are determined by a secondcircuit from the integrated circuit based on a first set of auxiliarybits computed by the first circuit, the second circuit comprising asecond one of the layers of the lookup tables.
 8. The method of claim 7,wherein bits p₁₆, p₁₅, p₁₄, p₁₃, p₁₂, p₁₁, and p₁₀ of the multiplicationresult p are determined by a third circuit from the integrated circuitbased on auxiliary bits computed by the second circuit, the thirdcircuit comprising a third one of the layers of the lookup tables. 9.The method of claim 8, wherein determining bit p₁₉ of the multiplicationresult p comprises: determining, by using a subset of the lookup tables,that t≤d, wherein t=┌2¹⁹/w┐ is a precomputed threshold, and d is theinput binary number; and in response to t≤d, setting p₁₉ to 1, andotherwise setting p₁₉ to
 0. 10. The method of claim 8, whereindetermining bit p₁₈ of the multiplication result p comprises:precomputing threshold values t₀₁, t₁₀, and t₁₁ using the known binarynumber w:t ₀₁=┌2¹⁸ /w┐,t ₁₀=└(2¹⁹−1)/w┘, andt ₁₁=┌(2¹⁹+2¹⁸)/w┐; and in response to (t₁₁≤d) or (t₀₁≤d≤t₁₀), settingthe bit p₁₈ to 1, and otherwise setting the bit p₁₈ to 0, d being theinput binary number.
 11. The method of claim 8, wherein determining bitp₁₇ of the multiplication result p comprises: precomputing a pluralityof threshold values using the known binary number w:t ₀₀₁=┌2¹⁷ /w┐,t ₀₁₀=└(2¹⁸−1)/w┘,t ₀₁₁=┌(2¹⁸+2¹⁷)/w┐,t ₁₀₀=└(2¹⁹−1)/w┘,t ₁₀₁=┌(2¹⁹+2¹⁷)/w┐,t ₁₁₀=└(2¹⁹+2¹⁸−1)/w┘, andt ₁₁₁=┌(2¹⁹+2¹⁸+2¹⁷)/w┐; and setting the bit p₁₇ to 1 in response to asubset of the plurality of threshold values:t ₁₁₁ ≤d,t ₁₀₁ ≤d≤t ₁₁₀,t ₀₁₁ ≤d≤t ₁₀₀, andt ₀₀₁ ≤d≤t ₀₁₀; and setting p₁₇ to 0 otherwise, wherein d is the inputbinary number.
 12. A system comprising: a memory device that stores aknown binary number (w); and a multiplication circuit that is configuredto perform a method to determine a multiplication result (p) of theknown binary number w with an input binary number (d) that is receiveddynamically, the method comprising: configuring a plurality of lookuptables in the multiplication circuit based on the known binary number w,wherein the lookup tables are setup in three layers; and determining themultiplication result p of the known binary number w and the inputbinary number d using the lookup tables, wherein each bit (p_(i)) fromthe multiplication result p is determined from the lookup tables basedon specific combinations of bits from the known binary number w and fromthe input binary number d, wherein j_(x) is a binary notation thatrepresents the x^(th) rightmost bit of j, with bit j₀ being therightmost bit of j.
 13. The system of claim 12, wherein bits p₅, p₄, p₃,p₂, p₁, and p₀ of the multiplication result p are determined by a firstlayer of the lookup tables from the multiplication circuit based on bitsd₅, d₄, d₃, d₂, d₁, and d₀ of input binary number d.
 14. The system ofclaim 13, wherein the bits p₈, p₇, and p₆, of the multiplication resultp, are determined by a second layer of the lookup tables from themultiplication circuit based on a first set of auxiliary bits computedby the first layer.
 15. The system of claim 14, wherein bits p₁₆, p₁₅,p₁₄, p₁₃, p₁₂, p₁₁, and p₁₀ of the multiplication result p aredetermined by a third layer of the lookup tables from the multiplicationcircuit based on auxiliary bits computed by the second layer.
 16. Thesystem of claim 13, wherein determining bit p₁₉ of the multiplicationresult p comprises: determining, using a subset of lookup tables, thatt≤d, wherein t=┌2¹⁹/w┐ is a precomputed threshold; and in response tot≤d, setting p₁₉ to 1, and otherwise setting the bit p₁₉ to 0, d beingthe input binary number.
 17. The system of claim 13, wherein determiningbit p₁₈ of the multiplication result p comprises: precomputing aplurality of threshold values t₀₁, t₁₀, and t₁₁ based on the knownbinary number w:t ₀₁=┌2¹⁸ /w┐,t ₁₀=└(2¹⁹−1)/w┘, andt ₁₁=┌(2¹⁹+2¹⁸)/w┐; and in response to (t₁₁≤d) or (t₀₁≤d≤t₁₀), settingthe bit p₁₈ to 1, and otherwise setting the bit p₁₈ to 0, wherein d isthe input binary number.
 18. The system of claim 17, wherein determiningbit p₁₇ of the multiplication result p comprises: precomputing aplurality of threshold values using the known binary number:t ₀₀₁=┌2¹⁷ /w┐,t ₀₁₀=└(2¹⁸−1)/w┘,t ₀₁₁=┌(2¹⁸+2¹⁷)/w┐,t ₁₀₀=└(2¹⁹−1)/w┘,t ₁₀₁=┌(2¹⁹+2¹⁷)/w┐,t ₁₁₀=└(2¹⁹+2¹⁸−1)/w┘, andt ₁₁₁=┌(2¹⁹+2¹⁸+2¹⁷)/w┐; and setting the bit p₁₇ to 1 in response to asubset of the plurality of threshold values:t ₁₁₁ ≤d,t ₁₀₁ ≤d≤t ₁₁₀,t ₀₁₁ ≤d≤t ₁₀₀, andt ₀₀₁ ≤d≤t ₀₁₀; and setting the bit p₁₇ to 0 otherwise, wherein d is theinput binary number.
 19. A neural network system comprising: amultiplication circuit configured to determine a multiplication result(p) of multiplying a weight value (w) that has a known value with aninput value (d) that is received dynamically, wherein, p, d, and w arebinary numbers, determining the multiplication result p comprises:configuring a plurality of lookup tables in an integrated circuit basedon the weight value w that is a known value, the lookup tables arearranged in three layers; and outputting the multiplication result pusing the lookup tables, wherein each bit of the multiplication result pis independently determined from the lookup tables based on specificcombination of bits from the weight value w and from the input value d,wherein a notation j_(x) represents the x^(th) bit of j from the right,with bit j₀ being the rightmost bit of j.
 20. The neural network systemof claim 19, wherein the integrated circuit is a field programmable gatearray.
 21. The neural network system of claim 20, wherein each of thelookup tables receives at most 6 input bit-values and outputs a singlebit-value in response.
 22. An electronic circuit for determining amultiplication result (p) of a weight value (w) and an input value (d)that is received dynamically, wherein determining the multiplicationresult comprises: configuring a plurality of lookup tables based on theweight value (w), wherein the lookup tables are configured in threelayers; and outputting a respective bit (p_(i)) of the multiplicationresult (p) using the lookup tables based on specific combination of bitsfrom the weight value w and from the input value d, wherein a notationj_(x) represents the x^(th) bit of j from the right, with bit j₀ beingthe rightmost bit of j.
 23. The electronic circuit of claim 22, wherein:a first subset of the lookup tables determines bits p₅, p₄, p₃, p₂, p₁,and p₀ of the multiplication result p; a second subset of the lookuptables determines bits p₈, p₇, and p₆, of the multiplication result p; athird subset of the lookup tables determines bits p₁₆, p₁₅, p₁₄, p₁₃,p₁₂, p₁₁, and p₁₀ of the multiplication result p; and a fourth subset ofthe lookup tables determines bits p₁₉, p₁₈, and p₁₇ of themultiplication result p.
 24. A device comprising: a field programmablegate array that includes a plurality of lookup tables configured inthree layers to realize a multiplication circuit for two binary numbersw and d, wherein a binary multiplication result (p) of the binarynumbers w and d is output by outputting each bit (p_(i)) from p usingthe lookup tables based on a specific combination of bits from thebinary number w and from the binary number d.
 25. The field programmablegate array of claim 24, wherein each of the lookup tables includes atmost 6 inputs.
 26. A method comprising: using the device of claim 24 tomultiply (i) a 12-bit variable input binary number d by (ii) an 8-bitknown value binary number w, wherein a binary multiplication result (p)is determined by determining each bit (p_(i)) from p using the lookuptables based on a specific combination of bits from the binary number wand from the binary number d.
 27. The method of claim 26, wherein theinput binary number is obtained from a 12-bit sensor.